arXiv:l509.00106v5 [math.OC] 3 Jul 2016 


Noname manuscript No. 

(will be inserted by the editor) 


Adaptive Smoothing Algorithms for Nonsmooth 
Composite Convex Minimization 

Quoc Tran-Dinh 


Received: date / Accepted: date 

Abstract We propose an adaptive smoothing algorithm based on Nesterov’s 
smoothing technique in |24j for solving “fully” nonsmooth composite con¬ 
vex optimization problems. Our method combines both Nesterov’s accelerated 
proximal gradient scheme and a new homotopy strategy for smoothness pa¬ 
rameter. By an appropriate choice of smoothing functions, we develop a new 
algorithm that has the O (j)-worst-case iteration-complexity while preserves 
the same complexity-per-iteration as in Nesterov’s method and allows one to 
automatically update the smoothness parameter at each iteration. Then, we 
customize our algorithm to solve four special cases that cover various applica¬ 
tions. We also specify our algorithm to solve constrained convex optimization 
problems and show its convergence guarantee on a primal sequence of iter¬ 
ates. We demonstrate our algorithm through three numerical examples and 
compare it with other related algorithms. 

Keywords Nesterov’s smoothing technique • accelerated proximal-gradient 
method • adaptive algorithm • composite convex minimization • nonsmooth 
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1 Introduction 

This paper develops new smoothing optimization methods for solving the fol¬ 
lowing “fully” nonsmooth composite convex minimization problem: 

min {^’(a:) :=/(x)-f g(x)}, (1) 

where g : —>• M U {+oo} is a proper, closed and convex function, and 

f : M.P —>■ 'R U {-boo} is a convex function defined by the following max- 
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structure: 

f{x) := max < {x,Au) — (p{u) : u €U>. (2) 

Here, : M" —>• K U {+ 00 } is a proper, closed and convex function, and U is a 
nonempty, closed, and convex set in K", and H is given. 

Clearly, any proper, closed and convex function / can be written as us¬ 
ing its Fenchel conjugate /*, i.e., /(x) := sup{(x,u) — f*{u) : u S dom(/*)}. 
Hence, the max-structure ([^ does not restrict the applicability of the template 
Q. Moreover, Q also directly models many practical applications in signal 
and image processing, machine learning, statistics and data sciences, see, e.g., 
[iiniiiiiiiiiiaiisiisi] and the references quoted therein. 

While the first term / is nonsmooth, the second term g remains unspecified. 
On the one hand, we can assume that g is smooth and its gradient is Lipschitz 
continuous. On the other hand, g can be nonsmooth, but it is equipped with 
a “tractable” proximity operator defined as follows: g is said to be tractably 
proximal if its proximal operator 

proxg(x) := argmm{ 5 (y) -f (l/2)||y- xf : y € dom( 5 )} , (3) 

can be computed “efficiently” (e.g., by a closed form or by polynomial time 
algorithms). In general, computing prox^ requires to solve the strongly convex 
problem (§ , but in many cases, this operator can be obtained in a closed form 
or by a low-cost polynomial algorithm. Examples of such convex functions can 
be found in the literature including H1IT3I1H]. 

Solving nonsmooth convex optimization problems remains challenging, es¬ 
pecially when none of the two nonsmooth terms / and g is equipped with 
a tractable proximity operator. Existing nonsmooth convex optimization ap¬ 
proaches such as subgradient-type descent algorithms, dual averaging strate¬ 
gies, bundle-level techniques or derivative-free methods are often used to solve 
general nonsmooth convex problems. However, these methods suffer a slow 
convergence rate (resp., O (^) - worst-case iteration-complexity). In addition, 
they are sensitive to the algorithmic parameters such as stepsizes |Hj. 

In his pioneering work [24) . Nesterov shown that one can solve the non¬ 
smooth structured convex minimization problem Q within O (^) iterations. 
This method combines a proximity smoothing technique and Nesterov’s ac¬ 
celerated gradient scheme [2T] to achieve the optimal worst-case iteration- 
complexity, which is much better than the O -worst-case iteration com¬ 
plexity in nonsmooth optimization methods. 

Motivated by [23], Nesterov and many other researchers have proposed 
different algorithms using such a proximity smoothing method to solver other 
problems, to improve Nesterov’s original algorithm or customize his algorithm 
to specific applications, see, e.g., [^[SirfinSl[T7irmE{?l[^l^[55] . in |g. Beck 
and Teboulle generalized Nesterov’s smoothing technique to a generic frame¬ 
work, where they discussed the advantages and disadvantages of smoothing 
techniques. In addition, they also illustrated the numerical efficiency between 
smoothing techniques and proximal-type methods. In BET], the authors stud¬ 
ied smoothing techniques for the sum of three convex functions, where one 
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term is Lipschitz gradient, while the others are nonsmooth. In m, a variable 
smoothing method was proposed, which possesses the O -convergence 

rate. This convergence rate is worse than the one in |24j . However, as a com¬ 
pensation, the smoothness parameter is updated at each iteration. In addition, 
their method uses special quadratic proximity functions, while smooths both 
/ and g under their Lipschitz continuity assumption. 

In [53], Nesterov introduced an excessive gap technique, which requires 
both primal and dual schemes using two smoothness parameters. It symmetri¬ 
cally updates one parameter at each iteration. Nevertheless, this method uses 
different assumptions than our method. Other primal-dual methods studied 
in, e.g., (nide] use double smoothing techniques to solve Q, but only achieve 
O (i log (i))-worst-case iteration-complexity. 

Our approach in this paper is also based on Nesterov’s smoothing technique 
in [21]. To clarify the differences between our method and [HIM], let us first 
briefly present Nesterov’s smoothing technique in [24] applying to Q. 

Recall that a convex function bn : K" —)■ K is a proximity function of U 
if it is continuous, and strongly convex with the convexity parameter /if, > 0 
and U C dom( 6 fi). We define 

:= argmin{ 6 fi(u) : u gU} and Dn := sup{ 6 fi(it) : u GU} G [0,-foo). 

“ u 

Here, w'” and Dn are called the prox-center and prox-diameter ofU w.r.t. bu, 
respectively. Without loss of generality, we can assume that = 0 and 

fjLb = 1- Otherwise, we just rescale and shift it. 

As shown in [H], given 7 > 0 and bn, we can approximate / by as 

/■ylx) := max {(x. Am) — (p(u) — 'ybniu) : u GU} , (4) 

' U 


where 7 is called a smoothness parameter. Since is smooth and has Lipschitz 
gradient, one can apply accelerated proximal gradient methods wm to min¬ 
imize the sum /-y(-) -I- <?(•)• Using such methods, we can eventually guarantee 


Fix'') - F* < min 

7>0 


7(fc -I- 1)2 


iDu 


2V2\\A\\Ro^/Dn 

(fc+1) 


(5) 


where {a;^} is the underlying sequence generated by the accelerated proximal- 
gradient method, see [H], and Rq := ||x° — x*||. To achieve an e-solution x^ 
such that F(x^) — F* < e, we set 7 = 7 * := 7 ^ at its optimal value. Hence, 
the algorithm requires at most fcmax := [ 2 -\/ 2 || iterations. 

Our approach: The original smoothing algorithm in |24j has three computa¬ 
tional disadvantages even with the optimal choice 7 * := of 7 . 


(a) It requires the prox-diameter Dn of U to determine 7 *, which may be 
expensive to estimate when lA is complicated. 

(b) If e is small and Dn is large, then 7 * is small, and hence, the strong 
convexity parameter of Q is small. Algorithms for solving Q have slow 
convergence speed. 
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(c) The Lipschitz constant of V/-y is 7 “^||A|p = \\A\\'^Due~'^, which is large. 
This leads to a small step-size of <^/{\\a\\'^Du) in the accelerated proximal- 
gradient algorithm and hence, can have a slow convergence. 

Our approach is briefly presented as follows. We first choose a smooth proxim¬ 
ity function bu instead of a general one. We assume that is L{,-Lipschitz 
continuous with the Lipschitz constant Tb > = 1. Then, we define f^{x) as 

in Q, which is a smoothed approximation to / as above. 

We design a smoothing accelerated proximal-gradient algorithm that can 
updates 7 from 7 ^, to 7^+1 at each iteration so that ^k+i < Ik by perform¬ 
ing only one accelerated proximal-gradient step mm to minimize the sum 
F-ykH fik+i + 9 value of 7 . We prove that the sequence of the 

objective residuals, {F{x^) — i^*}, converges to zero up to the O (^)-rate. 

Our contributions: Our main contributions can be summarized as follows: 

(a) We propose using a smooth proximity function to smooth the max-structure 
objective function / in ([^, and develop a new smoothing algorithm. Algo¬ 
rithmic based on the accelerated proximal-gradient method to adaptively 
update the smoothness parameter in a heuristic-free fashion. 

(b) We prove up to the O (j)-worst-case iteration-complexity for our algorithm 
as in [21] to achieve an e-solution, i.e., F{x'^) — F* < e. Especially, with 
the quadratic proximity function 6 b/(-) := ( 1 / 2 )|| • —our algorithm 
achieve exactly the O (j)-worst-case iteration-complexity as in [Mj . 

(c) We customize our algorithm to handle four important special cases that 
have a great practical impact in many applications. 

(d) We specify our algorithm to solve constrained convex minimization prob¬ 
lems, and propose an averaging scheme to recover an approximate primal 
solution with a rigorous convergence guarantee. 

From a practical point of view, we believe that the proposed algorithm can 
overcome three disadvantages mentioned previously in the original smoothing 
algorithm in [24) . However, our condition Lb = 1 on the choice of proximity 
functions may lead to some limitation of the proposed algorithm for exploiting 
further the structures of the constrained set lA. Fortunately, we can identify 
several important settings in Section]^ where we can eliminate this disadvan¬ 
tage. Such classes of problems cover several applications in image processing, 
compressive sensing, and monotropic programming [51[IiJ[21[31]. 

Paper organization: The rest of this paper is organized as follows. Section 
briefly discusses our smoothing technique. Section presents our main algo¬ 
rithm, Algorithm [^ and proves its convergence guarantee. Section handles 
four special but important cases of 0. Section [^ specializes our algorithm 
to solve constrained convex minimization problems. Preliminarily numerical 
examples are given in Section For clarity of presentation, we move the long 
and technical proofs to the appendix. 

Notation and terminology: We work on the real spaces and K”, equipped 
with the standard inner product (•, •) and the Euclidean £ 2 -norm || • ||. Given 
a proper, closed, and convex function g, we use dom(( 7 ) and dg{x) to denote 
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its domain and its subdifTerential at x, respectively. If g is differentiable, then 
y g{x) stands for its gradient at x. 

We denote /*(s) sup{(s,a:) — f{x) : x G dom(/)}, the Fenchel conju¬ 
gate of /. For a given set X, 5x{x) := 0 if a: G A’ and 5x{x) := -l-oo, otherwise, 
defines the indicator function of X. For a smooth function /, we say that / is 
L/-smooth if for any x,x G dom(/), we have ||V/(a:) — V/(i)|| < Lf\\x — i||, 
where L{f) := Lf G [0,oo). We denote by the class of all Lj^-smooth and 
convex functions /. We also use /t/ = /t(/) for tfr® strong convexity param¬ 
eter of a convex function /. For a given symmetric matrix X, Amin(Ai) and 
Amax(Ar) denote its smallest and largest eigenvalues of X, respectively; and 
cond(Jf) is the condition number of X. Given a nonempty, closed and convex 
set X, dist {x, X) denotes the distance from x to X. 

2 Smoothing techniques via smooth proximity functions 

Let bu be a prox-function of the nonempty, closed and convex set lA with the 
strong convexity parameter fib = 1. In addition, bu is smooth on U, and its 
gradient Xbu is Lipschitz continuous with the Lipschitz constant Lf, > = 1. 

In this case, bu is said to be Lf,-smooth. As a default example, bu{-) '■= (1/2)|| ■ 
— 12 '^IP for fixed G U satisfies our assumptions with Lf, = jSh = 1. Let u'^ 
be the 6 -prox-center point of U, i.e., := argmin„ {bu{u) : u G U}. Without 

loss of generality, we can assume that bu(u‘^) = 0. Otherwise, we consider 
buiu) := bu(u) - buiu^^)- 

Given a convex function ip*{z) := max„ {{z,u) — ip{u) : u GU}, we define 
a smoothed approximation ofi^* as 

(p*{z) := max{(z,u) - (p{u) - 'ybu{u)} , ( 6 ) 

u^U 

where 7 > 0 is a smoothness parameter. We note that tp* is not a Fenchel 
conjugate of unless U = dom((/?). We denote by u*{x) the unique optimal 
solution of the strongly concave maximization problem (|^, i.e.: 

u*(z) G argmax{( 0 ,u) - ip(u) - jbu(u) : u Gl^} . (7) 

’ U 

We also define Du '■= sup„ {bu{u) : u GU Ci dom((/ 5 )} the &-prox diameter of 
U. If U or dom((/j) is bounded, then Du G [0, -l-oo). 

Associated with (^*, we consider a smoothed function for / in Q as 

fj{x) := x) = max {(A^x, u) — (p{u) — "fbu{u) : u GU] . ( 8 ) 

Then, the following lemma summaries the properties of the smoothed function 
defined by § and f~f defined by (|^, whose proof can be found in [ST]. 

Lemma 1 The function ip* defined by § is convex and smooth. Its gradient 
is given by := u*^{z) which is Lipschitz continuous with the Lipschitz 

constant L,p* := 7 “^. Consequently, for any z,z G IR", we have 

'^\\u*^{z) - u*^{z)f < ip*^(z) - ip*^{z) - {Xlp*^{z),z-z) < ^||z- zf. (9) 
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For fixed z S K", is convex w.r.t. 7 S R++, and 

‘P*^{z)-{f-'y)bu{u*^iz))<ip^z), V7,7GK++. (10) 

As a consequence, defined by ([^ is convex and smooth. Its gradient is 
given by Wfj{x) = Au*(A^x), which is Lipschitz continuous with the Lipschitz 
constant Lf_^ := 7 “^||A||^. In addition, we also have 

f-r{x) < f{x) < f'jix) + -fDu, Vx G W. (11) 

We emphasize that Lemmaj^provides key properties to analyze the complexity 
of our algorithm in the next setions. 


3 The adaptive smoothing algorithm and its convergence 

Associated with Q, we consider its smoothed composite convex problem as 

•= '■= + 9{x)} ■ (12) 

Similar to [24], the main step of Nesterov’s accelerated proximal-gradient 
scheme 


applied to the smoothed problem ( 12 ) is expressed as follows: 
x^+i prox^g {x^ — /3V/^(a;^)) 




(13) 


where if is given, and /3 > 0 is a given step size, which will be chosen later. 
The following lemma provides a descent property of the proximal-gradient 


step (13), whose proof can be found in Appendix A.l 


F.y{x’^+^) < fyx) + i(a;'’’+i - x^,x - x^) - ^ 


Lemma 2 Let be generated by (13). Then, for any x G 

where 

fyx) := f^ix'^) + {yf^{x'^),x - x^) + g{x) 
< F^{x) - l\\u*JA^x) - u*JA^x^)f 


x^ — x 


we have 

k+i\\2^ (14) 


(15) 


“7\ 

We now adopt the accelerated proximal-gradient scheme (FISTA) in ji] to 
solve ( 12 ) using an adaptive step-size fik+i '■= plpi which becomes 

f := (1 - Tk)x^ + Tki^ 


pfe+l 


pfc+l 


= {x^ - /3fc+iV/7,^,(i'')) 

= i*’- A(ifc_a;fc+i), 


(16) 


where 7^+1 > 0 is the smoothness parameter, and G (0,1]. 

By letting tk ■= we can eliminate x^ in (16) to obtain a compact version 


:= prox^^^g {x^ - fik+i"^ f^.^fx’^)) 

k ^/c\ 


^fc+l ^^4-1 


-I- - x'^). 

cfc +1 ^ ' 


(17) 


The following lemma provides a key estimate to prove the convergence of the 
scheme (16) (or ([l7|), whose proof can be found in Appendix A.2 
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Lemma 3 Let ,Tk,^k)} be the sequence generated by (16 1 . Then 


+ ) < (l-Tfe)F^^(a: )+TkF{x) + 


for any x gM-P and Rk is defined by 


27fc+i 


\i^-xr-\\x^+^-xr\-Rk, 

(18) 


Rk ■= Tk-fk+ibu{ul (A^x'^)) - (1 - Tfc)( 7 fc - 'yk+i)buiu* (A^x'^)) 


(l-rfc)7fc4.i I 


?.fcM |2 


Moreover, the quantity Rk is bounded from below by 

Rk>\{l- Tk) [ik+iXk - Lb{-fk - 7fe+i)] hu[uf^^^{A^x’^)). 


(19) 


( 20 ) 


Next, we show one possibility for updating Tk and 7 /c, and provide an u pper 


bound for (x^) — F*. The proof of this lemma is moved to Appendix A.3 


Lemma 4 Let us choose x^ := x^ G dom(A), 71 > 0, and an arbitrary con¬ 
stant c > 1. If the parameters Tk and ■jk O'Xe updated by 


1 . 7ic 

Tk := ^ and 'jk+i := 


k + c 


k + c 


then the quantity Rk defined by (19) and {{Tk,^k)} satisfy 


Tf (fc + c )2 

Moreover, the following estimate holds 

k-H 


(1 - Tk)'yk+i 


Ik 


'k-l 




where 

Sk-.= ilt?Y.U 


rl 

■(l-To)7i 

Ik+i 


'{D- 



[F,fix^)-F*)+''^\\x^-x*r + SkDu 


( 21 ) 


( 22 ) 


,(23) 


< 7 rc 7 ^&-l) (ln(A:+c) + l)+ 7 ?(c+l). (24) 
In particular, if we choose by such that Lf, = 1, then Sk < 7 i(c+ 1). 

By (13 , the second line of ^ reduces to x’^+^ := x^+^ + (x'^+i - 

x^). Using this step into 0 and combining the result with the update rule 
(21), we can present our algorithm for solving 0 as in Algorithm]^ 

The following theorem proves the convergence of Algorithm[^and estimates 
its worst-case iteration-complexity. 

Theorem 1 Let {x^} be the sequence generated by Algorithm^ using c = 1. 
Then, for k > 1, we have 


F{x^) -F* < 


||Af 

271 fc 


iliDu , 7i(^b - 1) (ln(A:) -|- 1) Du 


• (25) 
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Algorithm 1 (Adaptive Smoothing Proximal-Gradient Algorithm) 

Initialization: 

1 ; Choose 7 i > 0, c > 1 and S Set := 

Iteration: For k = 0 to /cmax, perform: 

2 ; Solve the following strongly concave maximization subproblem 

:= argmax|(x'=, Am) - ip{u) - 7 fc+i&t/(M)|. 

3: Perform the following proximal-gradient step with Pk+i ■= 

:= prox^^^g (x'=)) . 

4 : Update x'^+i := x'=+i -f (|±|^) - x'=). 

5: Compute 7 fc +2 := ^ 77 ^- 

End for 


Ifbu is chosen so that Lb = 1 {e.g., hu{-) := ||| •—m°|P), then (25) reduces to 


F{x^) -F* < 




271 fc 


371 
k ^ 


(Vfc > 1). 


Consequently, if we set 71 := which is independent of k, then 

F{x^) -F* < (Vfc > 1), 


(26) 


(27) 


where Rq := ||x° — x*||. 

In this case, the worst-case iteration-complexity of Algorithm^to achieve 
an e-solution x^ to ([^ such that F{x^) — F* < e is fcmax := O ^^ ^ 

Proof From ([2^, c = 1 we have ^ = 7 ^- Using this bound 

and Sk-i < 7 i (Ufa — 1) [In(fc) -I- 1] -I- 27 ^ into ([^ we get 






|x°-x*|P 


7i(l - To) 


[F,a(x°)-F*] 


-f 


(71 (Lb - 1 ) [In(fc) + 1] + 271 ) Du 
k 


Since F(x'')-F^^ (x'") < jkDu due to ([^, and 7 ^ = = x- Substituting 

this inequality into the last estimate, and using tq = | = 1, we obtain (25). 

If we choose bu such that Lf, = 1, e.g., bu{-) := (l/2)|j • —then 
Sk < 27 ^ as shown in (24). Using this, it follows from (25) that F{x^) — F* < 


+ ^Di(. By minimizing the right hand side of this estimate w.r.t 

yiv 

k 


71 > 0 , we have 71 := and hence, F(x'=)-F* < ^hich is 


exactly (27). The last statement is a direct consequence of (27). 


□ 
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For general prox-function with L;, > 1, Theorem shows that the 
convergence rate of Algorithm 1 is O i which is similar to m- However, 

when Lf, is close to 1, the last term in (25) is better than El Theorem 1]. 

Remark 1 Let bu{-) := (1/2)|| • —Then, (27) shows that the number of 

flol|A||v^6A7 j ^ jg 


maximum iterations in Algorithm 


same, fcmax := 


is k-n 


2>/2||A||itovTA7 j ^ g^g jjj (with different factors, V6 and 2\/2). 


4 Exploiting structures for special cases 

For general smooth proximity function bu with L;, > 1, we can achieve the 
O convergence rate. When Lb = 1, we obtain exactly the O (j) 

rate as in [53]. In this section, we consider three special cases of 0 where we 
use the quadratic proximity function bu{-) ■= (1/2)11 • Then, we specify 

Algorithm for the L^-smooth objective function g in ([^. 

4.1 Fenchel conjugate 

Let /* be the Fenchel conjugate of /. We can write / in the form of ([^ as 
f{x) = max{(x,u) — f*{u) : u € dom(/*)} . 


We can smooth / by using hu{u) := ( 1 / 2 )||m ||2 as 

/^(x) := max {(x,u)-/*(u)-(7/2)||u||^} = ^ 

4iGaom(/*) 

where is the Moreau envelope of a convex function h with a parameter 
/3 0 . In this case, u*(x) = prox^-ij:,(j ^x) = 7 ^(x — prox.^^(x)). Hence, 
Wf-f{x) = 7 “^(x —prox.^^(x)). The main step, Step|^ of Algorithm [^becomes 

x'=+i = prox^^^g(prox 

))■ 

Hence, Algorithm can be applied to solve ([^ using the proximal opera¬ 
tor of / and g. The worst-case complexity bound in Theorem [l] becomes 
O where Udomf/*) := max ||u|| is the diameter of dom(/*). 

4.2 Composite convex minimization with linear operator 

We consider the following composite convex problem with a linear operator 

that covers many important applications in practice, see, e.g., [Il[3lll5j: 

-F"* := min {F(x) :=/(Ax)-f g(x)}, (28) 

where / and g are two proper, closed and convex functions, and A is a linear 
operator from to K". 

We first write /(Ax) := max„ {(Ax,m) — f*{u) : u S dom(/*)}. Next, we 
choose a quadratic smoothing proximity function bu{u) := (l/2)||u- for 
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fixed S dom(/*), and define U := dom(/*). Using this smoothing prox- 
function, we obtain a smoothed approximation of f{Ax) as follows: 

fj{Ax) := max{(^x,M) - f*{u) - (7/2)||u - : u € dom(/*)} . 

In this case, we can compute u*^{Ax) = prox^-i {u'^ + ^~^Ax^ by using the 
proximal operator of f*. By Fenchel-Moreau’s decomposition prox^-i( 7 “^u) = 
— prox^j( 7 u)) as above, we can compute prox^-iusing the proximal 
operator of /. In this case, we can specify the proximal-gradient step (13) as 




= u^ + 7fc+i [Ax'^ - prox^^^^^(7fc+iu“ -h Ax’^)'^ 

prox^,+ig h'" - l3k+iA^ul) > 


where Pk+i ■= 7fe-i-i||^||~^- Using this proximal gradient step in Algorithm|u 
we still obtain the complexity as in Theoremwhich is IHL d ll - v i S iT 

where the domain hi := dom(/*) of /* is assumed to be bounded. 


j 5 


4.3 The decomposable structure 

The function (f and the set in ([^ are said to be decomposable if they can 
be represented as follows: 


ip{u) := (pi{ui), and U Ui x ■ ■ ■ x 


(29) 


where m > 2, Ui G K"S Ui C M"* and this case, we also say 

that problem ([^ is decomposable. 

The structure ( |29[ ) naturally arises in linear programming and monotropic 
programming. In addition, many nondecomposable problems such as consensus 
optimization, empirical loss optimization, conic programming and geometric 
programming can also be reformulated into ([^ with the structure (29). The 
decomposable structure (29) immediately supports parallel and distributed 


computation. Exploiting this structure, one can design new parallel and dis¬ 
tributed optimization algorithms using the same approach as in Algorithm 
for solving ([^, see, e.g., [10111311291150] . 

Under the structure (29), we choose a decomposable smoothing function 
hxi'u) '■= buiiui), where is the prox-function of Ui for i = 1, • • • ,m. 

The smoothed function fj for / is decomposable, and is represented as follows: 


f^{x) := V fAx) := max {{x,AiUi) - (pi{ui) - 'ybuiiui)} 


Ui^lAi 


(30) 


Let us denote by u* ^(A^a:) the unique solution of the subproblem i in (30) 
for i = 1, • • • , m. Then, under the decomposable structure, the evaluation of 
fy and u*{A^x) := [u* ^^(A^x), • • • ,u* „j(A^x)] can be computed in parallel. 
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If we apply Algorithm]^ to solve ([T]) with the structure (29), then we have 
the following guarantee on the objective residual: 


F{x'^) - F* < 


LaWx'Fi;* 

271 fe 


_j_ (|3 _|_ _ 2) (ln(fc + 1) + 1)). 


where La := := niax{L{,, : 1 < i < m} and Du := 

Hence, the convergence rate of Algorithm stated in Theorem jl is O ■ 


If we choose := (I/2)|| • —for all * = I,-- - ,to, t 


len Lf, = 1- 


Consequently, we obtain the O i LaE^ULm. j -worst-case iteration-complexity. 


4.4 The Lipschitz gradient structure 

If g is smooth and its gradient V 5 is Lipschitz continuous with the Lipschitz 
constant Lg > 0, then F^ := + g G ^^7 is Lipschitz continuous 

with the Lipschitz constant ■= Lg -|- 7 “^|jA|p. 

We replace the proximal-gradient step (13) using in Algorithm by the 
following “full” gradient step 


:= x'^ - Pk+i (V5(i'=) + Au;^^ , 


(31) 


where u* (A^x^) is computed by ( 1 ^ and fik+i '■= 7 - ii,. ,. 3 ' is a given 

step-size. Unlike (HI), we update the parameters and 7 ^ as 


Tk ■■ = 


fc -I- 1 


and 7 fc+i := 


fc7fc||^l!" 


Lgjk + ||^|P(fc + 1) 


where 71 := is hxed. We name this variant as Algorithm |^b) . 

The following corollary summarizes the convergence properties of this vari¬ 
ant, whose proof can be found in Appendix |A.4| 

Corollary 1 Assume that g G with the Lipschitz constant Lg > 0. Let 
{x'^} be the sequence generated by Algorithm^h). Then, for k > 1, one has 

+ ^ (^ + 1 ) (ln(fc) + l)J^.(32) 


If we choose bu such that Lb = 1, then (32) reduces to 

F(x^) -F*< + 2)Du. 

Consequently, the worst-case iteration-complexity of Algorithm^^) is 
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5 Application to general constrained convex optimization 

In this section, we customize Algorithm to solve the following general con¬ 
strained convex optimization problem: 

ip* ■= min { ip{u) : Au — b G —K, u (33) 

uGK”' I J 


where v? is a proper, closed and convex function from K" —>■ M U {-l-oo}, 
& and K, are two nonempty, closed and convex set in M" 

and E.P, respectively. Without loss of generality, we can assume that (p and U 
are decomposable as in (29) with m>l. 

Associated with the primal setting (1^, we consider its dual problem 


F* := min 

a:GRJ’ 


\ F{x) := max{(a:, Am) — ip{u)} — {h,x) + max(x,r) L (34) 
I u&A vGK J 


Clearly, ( [M| has the same form as 0 with f(x) := max {(a:. Am) — ip(u) : u € blj 

and g(x) := s/c(x) — {b,x), where sk. is the support function of 1C. 

We now specify Algorithm to solve this dual problem. Computing m*(x) 
requires to solve the following sub-problem: 


u*{x) := argmin{(a;, Am) - ip{u) - ')bi({u)) . 

' U 


The proximal-step of g becomes proXg(a:) := prox^,,, (a;-|-6) = (a;-|-6)—proj;(^(x-|- 
b), where proj^(-) is the projection onto K. Together with the dual steps, we 
use an adaptive weighted averaging scheme 


k k 

2=0 2 = 0 


(35) 


to construct an approximate primal solution to an optimal solution u* of 
(33). Clearly, we can compute recursively starting from m° := 0" as 

[x^), where Vk'■= {rkTk)~^lk+i C {Cl,!]- (36) 


:= (1 - Vk)u + 


VkU. 


7fc+i 


We incorporate this scheme into Algorithmto solve ( |M| ). While Algorithm 
[^constructs an approximate solution to the dual problem (|M|), (|^ allows us 
to recover an approximate solution of the primal problem ( |33[ ). We name 
this algorithmic variant as Algorithm [^c). 

We specify the convergence guarantee of Algorithm [T|c) in the following 
theorem. The proof of this theorem is given in Appendix|A.5[ 


Theorem 2 Assume that bu is chosen such that Lf, = 1, and c = 1 in (21). 
Let {(a;*^,M^)} be generated by Algorithmize). Then {m^} C U and 


-||a:*||dist ib-Au^.K) < ip{u^) - ip* < ^ 

IIAI 


"+2(7l-^27^^)D^ 


dist {b—Au^, /C) < 


7l(fe+l) 

{\\x°^*\\ + y/\\xO-x*\\^+2\\A\\-^(2'il+li)Du'^ 
7l(fe+l) 


(37) 


Consequently, the worst-case iteration-complexity of AlgorithmWlc) to achieve 
an s-solution such that \ip{u^) — ip*\ < e and dist (6 — Avf,JC) < e is O . 
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Theorem shows that Algorithm j^c) has the O (j) worst-case iteration- 
complexity on the primal objective residual and feasibility violation for (33). 


6 Preliminarily numerical experiments 

We demonstrate the performance of Algorithm for solving the three well- 
known convex optimization problems. The first example is a LASSO problem 
with .^i-loss [S3], the second one is a square-root LASSO studied in [S|, and the 
last example is an image deblurring problem with a non-smooth data fidelity 
function (e.g., the £i-norm or the £ 2 "iiorm function). 


6.1 The ^i-^i-regularized LASSO 

We consider the -regularized LASSO problem studied in [M] as follows: 


F* := min{F(a:) := ||i?a; — 6 ||i -|- A||a;||i : x € , (38) 


where B and b are defined as in (39), and A > 0 is a regularization parameter. 

The function f{x) := \\Bx — b\\i = max {{B^u,x) — {b,u) : ||m||oo < 1} 
falls into the decomposable case considered in Subsection |4.3[ Hence, we can 
smooth / using the quadratic prox-function to obtain 

f^{x) := max ^{x,B^u) - {b,u) - {-f/2)\\u\\^ : u G Hooj. 


Clearly, we can show that u*^{Bx) := projg^ (7 ^{Bx — 6 )). In this case, we 
also have := and U := Boo- 

Now, we apply Algorithm to solve problem (39). To verify the theoretical 
bound in Theorem[^ we use CVX [TH] to solve (39) and obtain a high accuracy 
approximate solution x*. Then, we can compute i?o := ||x° — a;*|| 2 , and choose 


7i = 7i := 


l|g||flo 


From Theorem 


we have F{x^) — F* < 


RqWBW^GDb^ 


which is the worst-case bound of Algorithm]^ where k is the iteration number. 
For our comparison, we also implement the smoothing algorithm in |24j us¬ 


ing the quadratic prox-function. As indicated in ([^, we set 7 = 7 * := . 

Hence, we also obtain the theoretical upper bound F{x^)—F*< ^ 

We name this algorithm as Non-adapt. Alg. (non-adaptive algorithm). 

The test data is generated as follows: Matrix B G is generated ran¬ 
domly using the standard Gaussian distribution ^(0,1). We consider two 
cases. In the first case, we use non-correlated data, while in the second case, we 
generate B with 50% correlated columns as B{:,j -I- I) = 0.5il(:, j) -|-randn(:). 
The observed measurement vector b is generated as b := Bx^ -I-^(0,0.05), 
where x^ is a given s-sparse vector generated randomly using ^( 0 , 1 ). 

We test both algorithms: Algorithm [l] and Non-adapt. Alg. on two problem 
instances of the size (p, n, s) = (1000,350,100) (with and without correlated 
data, respectively). We sweep along the values of A to find an optimal value for 
A which are A = 6.2105 for non-correlated data, and A = 5.7368 for correlated 
data, respectively. For comparison, we first select the optimal value for 71 := 7 * 
and 7 := 7 * in both algorithms. Then, we consider two cases: (i) 71 := IO 7 * 
and 7 := IO 7 *, and {ii) 71 := O.I 71 and 7 := O.I 7 *. 
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-■ — Algorithm 1 (Theoretical bound) 

—Non-adapt. Alg. (Theoretical bound) 
— Algorithm 1 (71 = 71) 

■0- ■ Algorithm 1 (71 = IO7J) 

Algorithm 1 (71 = O.I7J) 

X— Non-adapt. Alg.(7 = 7*) 

Non-adapt. Alg. (7 = IO7*) 
Non-adapt. Alg. (7 = O-I7*) 



^iteration (x200) 


^iteration (x200) 


Fig. 1 The empirical performance vs. the theoretical bounds of the 6 algorithmic variants 
(Left: non-correlated data, Right: correlated data). 


Figure plots the empirical bounds of the 6 variants vs. the theoretical 
bounds from 200 to 10,000 iterations. Obviously, both algorithms show their 
empirical rate which is much better than their theoretical bound. But if we 
change the smoothness parameters, the guarantee is no longer preserved. Al¬ 
gorithm shows a better performance than Non-adapt. Alg. in both cases. 

6.2 Square-root LASSO 

We consider the following well-known square-root LASSO problem: 

min |F(a;) := \\Bx - b \\2 + A||a:||i|. (39) 

As proved in [5], if matrix B is Gaussian, then we can select the regularization 
parameter A such that we can obtain exact recovery to the true solution x^. 
The function / defined by f{x) := \\Bx — h \\2 can be written as 

f(x) = max I {B^u, x) — {b, u) : ||m ||2 < l| ■ 


Let B 2 := {u S M" : ||w ||2 < 1} be the £ 2 -norm ball. We choose b{u) := ^llwHi 
as a prox-function for 82 - Then, we can smooth / using 5(-) := A|| • ||| as 


f~^{x) := max {{x,B^u) - {b,u) - ( 7 / 2 )||u ||2 ■ u G 82 } ■ 

U 

Clearly, u*{x) := projg^ {^~^{Bx — b)^ is the solution of the maximization 
problem, where projg^ is the projection onto 82 - Moreover, we have Dy = f. 

Now, we apply Algorithm[^to solve problem ( [3^ . We choose c := 1 and set 
7 i = 7 * := , where Rq ■= ||a:° — a:*|| 2 . We also estimate the theoretical 

upper bound indicated in Theorem 0 for F{x^) — F* using ( [^ , which is 
\\A\\iioV 6 D u_ ^ implement the smoothing algorithm in [24] for our comparison 
by using the same prox-function. The parameter of this algorithm is set as in 
the previous example. 

The data test is generated as in Subsection |6.1[ We also perform the test on 
two problem instances of size {p,n,s) = (1000,350,100): non-correlated data 
and correlated data. We choose the regularization parameter A as suggested in 
[ 8 | . We use the same setting for the smoothness parameter 7 in both algorithms 
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as in Subsection |6.1[ In this case, the theoretical upper bound of Algorithm 
depends on the log-term which is scaled by the condition number of BB^, and 
is worse than in Non-adapt. Alg. variant. 


— ■ — Algorithm 1 (Theretical bound) 


—□— Non-adapt. Alg. (Theretical bound) 

■ 

—□— Algorithm 1 (71 = 7J) 


— 0— • Algorithm 1 (71 = IO71) 

1 

— J|6— Algorithm 1 (71 = O.I71) 



—X— Non-adapt. Alg.(7 = 7*) 



— • Non-adapt. Alg. (7 = 107*) 

1=1 


♦ Non-adapt. Alg. (7 = 0.17*) 





- Algorithm 1 (Theretical bound) 

—□—Non-adapt. Alg. (Theretical bound) 
—□— Algorithm 1 (71 = 7^) 

- o— • Algorithm 1 (71 = IO71) 

Algorithm 1 (71 = O.I71) 

—X— Non-adapt. Alg. (7 = 7*) 

Non-adapt. Alg. {7 = IO7*) 
f Non-adapt. Alg. {7 = 0.17*) 










^iteration (x200) 


#iteration {x200) 


Fig. 2 The empirical performance vs. the theoretical bounds of the 6 algorithmic variants 
(Left: non-correlated data, Right: 50%-correlated collumns). 


Figure plots the empirical bounds of the 6 variants vs. the theoretical 
bounds from 200 to 10,000 iterations. Obviously, both algorithms show their 
empirical rate which is much better than their theoretical bound. Algorithm 
gives a better performance than the nonadaptive method in this example. We 
note that the theoretical bound in Algorithm remains non-optimal, while it 
is optimal in the nonadaptive one. 




Fig. 3 Comparison of Algorithm [ij and Bo-J&Hendrich Alg. (Left: the objective values vs. 
the number of sweeping points, Right: Convergence of the relative objective residual). 


Finally, we compare Algorithm with the variable smoothing algorithm 
in [TT] (Bo-g&Hendrich Alg.). Whlile the first term f(x) := \\Ax — b \\2 is 
smoothened as in Algorithm]^ we smooth the second term g{x) := A||a:||i as 

gfjix) := max{(a;,t;) - {l3/2)\\v\\l : ||w||oo < l} ■ 

V 

Then, we update 7 /c and Pk as 7 ^ = ^aik+i) Pk = sjyF+T)’ respectively as 
suggested in m, where Ca and Cb are two appropriate constants. 
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We compare Algorithm[^and Bo-J&Hendrich Alg. on a problem instance of 
size (p, n, s) = (1000, 350,100), where the data is generated as in the previous 
tests. To find an appropriate value of Cq and Cb, we sweep Ca S [10,5000] 
simultaneously with Cb S [0.001,500]. We obtain Ca = 51 and ct = 49. For 
Algorithm]^ we consider two cases. In the first case, we set 7 i = 7 * = 129.5505 
computed from the worst-case bound, while in the second case, we also sweep 
7 i G [10,1000] to find an appropriate value 71 = 51. The results of both 
algorithms are plotted in Figure]^ for 5000 iterations. 

Figure (left) shows that the objective value produced by Algorithm 
does not vary much when 71 G [10,1000], while, in Bo-^&Hendrich Alg., the 
objective value changes rapidly when we sweep on Ca and Cb simultaneously. 
Hence, it is unclear how to choose an appropriate value for Ca and Cb without 
sweeping. Figure]^ (right) shows the convergence behavior of both algorithms. 
Without sweeping, Algorithm has a good empirical convergence rate in the 
early iterations. With sweeping, both algorithms perform better in the later 
iterations. Algorithmhas a better performance than Bo-^&Hendrich Alg.. 

6.3 Image deblurring with the ii or £ 2 "<lata fidelity function 

We consider an image deblurring problem using the fa-norm fidelity term as 

imn{F(A) := [[^(A) - b\\^ + AUlFAUi : A G , (40) 


where a G {1,2}, A : —>■ {p = m x q) is a blurring kernel, b is an 

observed noisy image, IF : —>■ is the orthogonal Haar wavelet transform 
with four levels, A > 0 is the regularizer parameter. 

We now apply Algorithm (Alg . to solve problem (40) and compare 
it with the nonadaptive variant (Nes. Alg.) and Bo} & Hendrich’s algorithm 
(BH Alg.) in [TT]- Since A is orthogonal, we can use the quadratic smooth¬ 
ing function as bu{X) := (l/2)j|Aj||.. With this choice, we can compute the 
gradient of u*{X) defined by Q as u*{X) — projg. ( 7 “^( 7 l(A) — b)), where 
projg^ is the projection onto the dual norm ball of the f^-norm. 

We test three algorithms on the five images: cameraman, Barbara, lena, 
boat and house widely used in the literature. The noisy images are generated 
as in [1]. Although we use the non-smooth f^-norm function with a = 1 or 
a = 2, the regularization parameter A is set to A := 10“^ as suggested in [4], 
but it still provides the best recovery compared to other values in all 5 images. 

While we fix 71 = 62 in Algorithm which is roughly computed from the 
worst-case bound, we sweep 7 and Ca (see Subsection |6.2[ ) in [0.0001,1000] 
to choose the best possible value for Nes. Alg. and BH Alg. in each image 
(with 300 iterations). We also set Cb = Ca as suggested in [TT]. For Nes. Alg., 
we have 7 = 1 in the boat image, while in the other 4 images, 7 = 2.5 is 
the best value. For BH Alg., we have = 0.005 in the cameraman, Barbara 
and boat images, and = 0.0025 in the lena and house images. The PSNR 
(Peak Signal to Noise Ratio |1|) of the 8 algorithms are reported in Table 

It shows that the nonsmooth fi-norm objective produces slightly better 
recovery images in terms of PSNR than the f 2 -iiorm objective in many cases 
for Algorithmic but it is not the case in Nes. Alg . and BH Alg . In addition. 
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Table 1 The PSNR values reported by the 8 algorithmic variants on the 5 test images 


Images | cameramcm | barbara | lena 

PSNR of 4 algorithms after 300 iterations 

boat 

house 

Alg. 

Alg. 

Alg. 

Alg. 

Nes. 

Nes. 

BH 

BH 

1 

T 

f 

T 

41 

41 

{ii, 71 =62) 

(£l, 7 i-sweeping) 

{£ 2 , 71 =62) 

(£ 2 , 71 -sweeping) 

Jg. (£ 1 , 7 -sweeping) 
Jg. (£ 2 , 7 -sweeping) 
g. (£ 1 , Ca-sweeping) 
g. (£ 2 , Ca-sweeping) 

26.2140 

26.2693 

26.2128 

26.2128 

25.0601 

25.0908 

25.5784 

25.4784 

26.8253 

27.0682 

26.8232 

26.8232 

26.1376 

26.1361 

26.3421 

26.4421 

27.1793 

27.5440 

27.1782 

27.1782 

26.3776 

26.3901 

26.5916 

26.5916 

26.495 i 
26.5519 
26.4923 
26.4923 
25.2301 
25.2364 

25.6025 

25.6025 

30.9848 

31.6877 

30.4126 

30.4126 

30.2982 

30.4081 

31.1606 

31.1606 

PSNR of 4 algorithms after 500 iterations 

Alg. 

Alg. 

Alg. 

Alg. 

Nes. 

Nes. 

BH 

BH 

1 

T 

T 

T 

A 

41 

41 

(£ 1 , 71 =62) 

(£ 1 , 7 i-sweeping) 

(£ 2 , 71 = 62) 

(£ 2 , 71 -sweeping) 

Jg. (£ 1 , 7 -sweeping) 
Jg. (£ 2 , 7 -sweeping) 
g. (£ 1 , Ca-sweeping) 
g. (£ 2 , Ca-sweeping) 

27.0371 

27.1666 

27.0363 

27.0363 

25.0857 

25.0845 

26.5030 

26.5030 

27.6286 

27.8449 

27.6279 

27.6279 

26.1686 

26.1686 

27.1588 

27.1588 

28.1471 

28.2086 

28.1480 

28.1486 

26.4590 

26.4582 

27.1630 

27.1630 

27.3116 

27.4410 

27.3111 

27.3111 

26.1321 

25.2265 

27.0277 

27.0277 

32.1771 

32.8647 

32.1710 

32.1710 

30.4720 

30.4718 

31.8824 

31.8824 

PSNR of 4 algorithms after 1000 iterations 

Alg. 

Alg. 

Alg. 

Alg. 

Nes. 

Nes. 

BH 

BH 

1 

T 

T 

T 

A 

41 

41 

(£ 1 , 71 =62) 

(£ 1 , 71 -sweeping) 

(£ 2 , 71 =62) 

(£ 2 , 71 -sweeping) 

Jg. (£ 1 , 7 -sweeping) 
Jg. (£ 2 , 7 -sweeping) 
g. (£ 1 , Ca-sweeping) 
g. (£ 2 , Ca-sweeping) 

27.4774 

27.3291 

27.2524 

27.2524 

25.0870 

25.0867 

27.1128 

27.1723 

27.8353 

27.8659 

27.8070 

27.8070 

26.1691 

26.1690 

27.8391 

27.8205 

28.4224 

28.4040 

28.4774 

28.4771 

26.4602 

26.4600 

27.9327 

27.9327 

27.6596 

27.9482 

27.5268 

27.5268 

26.1371 

25.2267 

27.3487 

27.3143 

32.9985 

33.2038 

33.1879 

33.1879 

30.4698 

30.4700 

32.6715 

32.6715 


Algorithm [2 is superior to Nes. Alg. in all cases, and is also better than 
BH Alg. in the majority of the test. We note that the complexity-per-iteration 
of the four algorithms are essentially the same, while our new adaptive strategy 
produces better solutions in terms of PSNR than the other two methods. In 
addition, our algorithm significantly improves the PSNR if we run it further, 
while the nonadaptive variant does not make any clear progress on the PSNR 
value if we continue running it. If we sweep the values of 71 in Algorithm 
( 71 -sweeping), we can also improve the results of this algorithm. 
Acknowledgements This research was supported by NSF, Grant No. IPF 16-4829. 


A Appendix: The proof of technical resnlts 

This appendix provides the full proof of the technical results presented in the main text. 

A.l The proof of Lemma Descent property of the proximal gradient step 

By using with f~/{x) := x), = A'Wip^{A^x), z := A^x, z := A^x, and 

||AT(x — x)|| < ||A||||fr — ic|| we can show that 

|||u*(A^a;) - u*^{A^x)\\'^ < f^{x) - fy(x) - {\7f^{x),x- x) < - £||2. 


Using this estimate, we can show that the proof of can be done similarly as in [31]. □ 



























18 


Quoc Tran-Dinh 


A.2 The proof of Lemma Key estimate 

We first substitute /3 = into l |14| l and using | |15| l to obtain 

I lAII^ /a;'=+l — x^ X — x’^) — -UAt||£* — 

7fc+l ' ’ ' 27fc+i II II 

Multiplying this inequality by (1 —r^) and by T]^^, and summing up the results we obtain 

+ - x'»+-l,£'= - (1 - Tk)x^ -Tfcx) - |^l|£'‘ 

where i^(x) := /-^(x'') + (Vf^/ix^), x — x^) + g{x). 

From we have TkX^ = — {1 — rk)x^, we can write this inequality as 

+ [lIS" - - l|5" - - x||2]. (41) 

Using l |10[ | with 7 := 7fe+i, 7 := Jk x^, we get 

< f~tki^’°) + (7fe - yk+i)bu(u*^^iA^x’^)), 
which leads to (c/: +5')- 

i" 7 fc+i (^") < ■P’ 7 . + ( 7 . - (A^x’^)). (42) 

Next, we estimate Using the definition of and V/^, we can deduce 

^Jk^i(x) ■■= + {^fik+iix’’)<x- x’‘)+g(x) 

= {x'^^A^u*^^ (TT^*)) - (A^x'^)) - 'y^^ibuiu*^^ (A^x^^)) 

+{x - x’^, Au*^^{A^ x'^)) +gix) 

= x'^)) - ■yk+ibu(u*i^^iA'^x'‘)) + g{x) 

< max{(a:, Au) - p{u) :u&U}- 'yi^ibuiu*^^{A^ x’^)) + g{x) 

= Fix) - 'YhHbuiu*^^iA^x’^}). 

Substituting x^^ := x^ — from the third line of ( |16[ l together with l |42[ |, and 

| |43[ | into ( |41[ |, we can derive 

P’7.+i(^'^ 7 < (l-rfe)P’7.(^'')+rfcP’(a:) + ^^£^ [Wx^^-xW^ - ||i'=+l-x||2] - R^, 
which is indeed ( |18[ l, where Rk is given by 

Finally, we prove ( |20[ |. Indeed, using the strong convexity and the Lf,-smoothness of hu^ 
we can lower bound 

Rk > ^^\\u;^^^{A^X>‘)-U-f + A-rk)-fk^l (^T^.) _ 

-"^(1 - xk)(-yk -“‘’11^- 

Letting Vk '•= {A~^x^) — and Vk '•= {A~^x^) — u^, we write Rk as 
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27fc+i-Rfe > T-fcll^lP + (1 - Tfc)||% - i;fc|p - (1 - Tfe)(7j.^’^7fc - l)Li,||*^fe|P 
= II%IP - 2(1 - Tk){vk,'Vk) + (1 - T-j;)[l - il^i'yk - l)ii,] IbfclP 
= W^k - (1 - rk)vkP + (1 - Tk) \Tk - (7^\7fe - 1)1^6] IkfclP 
> (1 - Tk) [rk - {■yk+i'yk - \\vkP, 


which obviously implies (|2Q[| 


A.3 The proof of Lemma The choice of parameters 

First, using the update rules of and 7 ^ in we can express the quantity as 


7fofi(J- - r-fc) bh^iTk - Lki-Jk - 7fcfi)] _ 7^ ^ li^b - l ) ( fc + c) + 1 ] 

ruk ^2 


{k + c )2 


Moreover, it follows from the properties of bn that 


< bu{u;^^{A^x'‘)) < Du. 


Multiplying the lower bound ([20| by and combining the result with the last inequality 

' ' 

and the estimate of mj^, we obtain the first lower bound in ( | 22 [ |. 

(1 —'^fc)7fc+i _ (fc + c—l)(fc + c)^ 7 ic _ 


Next, using the update rules © of Tk and , we have -— 


(fc+c)(fc + c) 


7 ic(fc+c-l)^ _ 7fc 
(fc + c-l) r2_. 


, which is the second equality in \22\. 


Using in with the lower bound of Rj^ from ( | 22 | l, we have 


7fe+l , ll^lPiufc+l (l-^fc)7fefl 


AFk+^^\\x^-x*f + SkDu, (44) 


where AFk '•= F^/i^(x^) — F* and '•= inequality and the 

relation ——in ( | 22 [ |, we can easily show that 

:i^^AFk+i+^^\\x’^^-x*f<^AFk+^^\\x'‘-x*f + SkDu. 

T, 2 T, , 2 

k k—1 


By induction, we obtain from the last inequality that 

l|2 


■^fc+1 AIT , ll^ll ii^AH -1 „*ii2/ (l-'^o)7l , ll^ll II-O ^*||2 , c r. 

— ^AFf^i-\ --— \\x^-x II <-^- AFo-\ --—||a: -X || -\-SkDy, 


(45) 


which implies where Sk := E, 7 o ^k = T,i^Q ' 

Finally, to prove ( |24| l, we use two elementary inequalities T ^ \n(k + c) and 

E k 1 -CT" 1 _L 

■i=0 (i+c)'^ — 2-^i=l 


{i+c-l)(i + c) c‘ 


^ 1 4- 1 


A.4 The proof of Corollary The smooth accelerated gradient method 

First, it is similar to the proof of ||44^, we can derive 


7fcfl _ 'Xk)'yhi-1 _ ^p <}_\\^k_ * n2 i ^ n 


( L 97 fe+l + || A || 2 ) r ,^ ^"■^"'2 


where AFk ■= FjAx'^) - F*, and Sk := ^^(1 xk)[Lt('yk 7|h-i) 

^ rliLg'ik + l + WAW^) 
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Next, we impose condition — '^fc) 7 fc+i _ ^— ,, ^ and choose 

’ ^ T^(I-57fc + l + l|Ap) rfZjLglkWW) ^ ^ + 1 


fe7fe l|A||^ 


fc + 1 • 
. Now, we show 


Then, we can show from the last condition that 7 ^ 4-1 = t- , . .. 9 ,, , ^, 

’ Lg-fk + \\Ap(k+l) 

that 7fc < ^- Indeed, we have ;^ + > (Tt) 7 , which implies 

that 7 fc+i < By induction, we get 7 fc+i < On the other hand, assume that 

—-— = ( ——h II , < — (for k > 2 . This condition leads to 7 ^ < rr. 

7fc + i V ^ ^ ||A|pfc - 7fc V^-ly - ’ Lg{k-1) 

Using 7 a; < ^ = urV choice of 71 , we can show that 7 ^ < ^ (fc— 1 ) • Bence, 

with the choice 71 := the estimate —-— < — and the update rule of 71 . 

eventually imply 


7l||^lP =71 < < 'I'l 

(L 971 + 2 ||A|| 2 )fc k - k + 1’ 


Vfc > 1. 


This condition leads to 


i‘fc_l(-t'g7fc + ||A|| ) 


= LgT^_^ 


-||^lP< 


3Lg(k-l) _ ZLg 

P “ k 


Using the estimates of and 7 /,., we can easily show that < ]^^k\k+ 2 ) ^ ■L^(fc+U(fc+ 2 ) 

n\ • Hence, we can show that 
Lg(A: + l}(fc + 2) ’ 

. ^ 2 L,||A ||2 yL,-l)||A||2^ 1 2 L,||A ||2 , (L 6 - 1 )||A ||2 

S “^1“ " S ^ ■ 

Using this estimate, we can show that 

^ ^*^3Ls„__o „*„2 , 2L6||A||2^ , r. 

)-F < —\\x -X II H-— Du-\ --(ln(A:) + 


Finally, using the bound | |ll[ | and jk ^ (fc+i) ^ zTfc ’ obtain ( |32| l. □ 

A.5 The proof of Theorem Primal solution recovery 

Let AFk ■.= (x^) - F*. Then, by we have AFk > F{x^) - F* - 'jkDu > -^yk^u- 

Similar to the proof of Lemma we can prove that 


< -^AF, + 

i-l 


7^1 


(x) + 


ll^ll' 


||x* - a;|p - ||x‘+l _ x\\'^) + SiDu, (46) 


where Si := as in the proof of Lemmaj^ and (x) = (x, — 

F{u*^yxf‘))+g{x)-F* = {x,Au*^^{x^)-b)-ip{u*^^{x'’)) + sic{x)-F*. Summing up 
this inequality from i = 1 to i = k and using tq = 1 and x^ = x^, we obtain 


"Efc 


7i+l 


l|A|| = 


X^ —xll^ — II X 


+-yiAFi + SkF>u, (47) 


where Sk '■= Now, using again l |46[ l with fc = 1, = x^* and tq = 1, we get 

7 iAFi < 7 iToAt^j^(x) + (II®*' — x*|p — ||xE — x*|p). Using this into ||47[l, one yields 


k 

< E — ((®, An;,+,(®”) - b)-¥^«^7x*)) + sk(x) - F*) 
i=0 

||x° - x||^ + Sf,Du- 


I|4I|P 


(48) 
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Combining defined by l |35| l with Wi \= and the convexity of we have 

k 

7^ ((x, - b) < Tfc [{x,Au'‘ - b) - (/?(«'“)) . 


Substituting this into | |48^ and then using > —^k+iDjj we get 

II^IP 


7^+1 


Du < n ((x,Au'= -b)- ^{u>=) + SK.{x)-F*)+'^\\x°-xf + SkDu, 


which implies 


F* < {x, Au’^ -b)- ¥.(«'=) + SK.{x) + ||x 0 - xf + 

fc 1 fc 


By arranging this inequality, we get 


k I ,^f^.k\ ^ t7* I II^II^II^O ^||2 1 ^V{ ( ^fc+l 


inf (x, b — Au^ — r) + < —F* + ^ — a^||^ + 

r£lC F]^ 


■Sk 


(49) 


where we use the relation —si(z(x) = — supj,^iQ{x,r) = Inir^Kjx, —r). On the other hand, 
by the saddle point theory for the primal and dual problems | |33[ | and ( |34[ | , for any optimal 
solution X*, we can show that 

— F* = ip* < ^ 'f')^ Vw G W, r G K.. 

Since this inequality holds for any r £ JC and u £14.) by using u = , it leads to 


inf (x*,Au^ — & + r) — ip{u^) < F*. 
rGK 


(50) 


Combining (|49[l and (|50[l yields 


mm 

r^K 


I (x* — X, r + Au*^ — b) — 4rJ“ < -p- ( + Sj; ) , Vx g (51) 

t 21'jj0 ^ Fp. \ Xi j 


Taking x := x° — ||A|| '^F^.{Au^ — b + r) for any r G AC, we obtain from | |51| l that 

which implies (by the Cauchy-Schwarz inequality) 


min(rfe||Au'' -b + rf - 2\\Af\\Au'° - 6 + r||||x* - x°||| < 
rGK. 1 J 


,o„i ^ nApDu ryUi 


Fk 


+ ‘S'A; I . 


By elementary calculations and dist (6 — Au^,/C) = min {|| — 6 + r|| : r G/C}, we can 

show from the last inequality that 


dist (j) — Au^,JC^ < —a:*|| + ^ 


2 r . / 2 

x^ — x*\\‘^ ^ - 


[Sk 


7lfi 


Di 


(52) 


(53) 


II^IP 

To prove the first estimate of ([37|, we use (j49| with x = 0^ and F* = —ip* to get 

- V* < + sPjDu]. 

k '^k 

Since we apply Algorithmic) to solve the dual problem ( |34[ | using such that = 1, we 
have < 2'yf. Then, by using 7 fc+i = '^k •= can show that = 71 c. 

Moreover, we also have F^ := Y2i^o — 7 ic(fc + l). Using these estimates, and < 27 ^ 
from int o (|52| l and ( |53| l we obtain ( |37[ ). For the left-hand side inequality in the first 
estimate of ( |37[ l, we use a simple bound —||x*||dist (6 — Aw, K) < ip{u) — (p* for u = £U 
from the saddle point theory as in (|50[l. □ 
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